Taste of Two Different Flavours: Which Manipuri Script works better for English-Manipuri Language pair SMT Systems?
نویسنده
چکیده
The statistical machine translation (SMT) system heavily depends on the sentence aligned parallel corpus and the target language model. This paper points out some of the core issues on switching a language script and its repercussion in the phrase based statistical machine translation system development. The present task reports on the outcome of EnglishManipuri language pair phrase based SMT task on two aspects – a) Manipuri using Bengali script, b) Manipuri using transliterated Meetei Mayek script. Two independent views on Bengali script based SMT and transliterated Meitei Mayek based SMT systems of the training data and language models are presented and compared. The impact of various language models is commendable in such scenario. The BLEU and NIST score shows that Bengali script based phrase based SMT (PBSMT) outperforms over the Meetei Mayek based English to Manipuri SMT system. However, subjective evaluation shows slight variation against the automatic scores.
منابع مشابه
Building Parallel Corpora for SMT System: A Case Study of English-Manipuri
The Statistical Machine Translation (SMT) systems are developed using sentence aligned parallel corpus. The difficulty is that there is no parallel corpus at the required measure for many language pairs. The preparation of large scale parallel corpus takes time and demands the linguistics skill. In the present work, the various issues of a quality parallel corpus and a technique that extracts p...
متن کاملStatistical Machine Translation of English – Manipuri using Morpho-syntactic and Semantic Information
English-Manipuri language pair is one of the rarely investigated with restricted bilingual resources. The development of a factored Statistical Machine Translation (SMT) system between English as source and Manipuri, a morphologically rich language as target is reported. The role of the suffixes and dependency relations on the source side and case markers on the target side are identified as im...
متن کاملManipuri Morpheme Identification
The Morphemes of the Manipuri word are the real bottleneck for any of the Manipuri Natural Language Processing (NLP) works. It is one of the Indian Scheduled Language with less advancement so far in terms of NLP applications. This is because the nature of the language is highly agglutinative. Segmentation of a word and identifying the morphemes becomes necessary before proceeding for any of the...
متن کاملAddressing some Issues of Data Sparsity towards Improving English- Manipuri SMT using Morphological Information
The performance of an SMT system heavily depends on the availability of large parallel corpora. Unavailability of these resources in the required amount for many language pair is a challenging issue. The required size of the resource involving morphologically rich and highly agglutinative language is essentially much more fo r the SMT systems. This paper investigates on some of the issues on en...
متن کاملSemi-Automatic Parallel Corpora Extraction from Comparable News Corpora
The parallel corpus is a necessary resource in many multi/cross lingual natural language processing applications that include Machine Translation and Cross Lingual Information Retreival. Preparation of large scale parallel corpus takes time and also demands the linguistics skill. In the present work, a technique has been developed that extracts parallel corpus between Manipuri, a morphologicall...
متن کامل